3  Remove Batch Effect

Note
  • The output is a dataframe that has been back-transformed from \(\log_{2}\), with integers retained.

  • Users do not need to input parameters like auto_mode and default_input. During the function execution, there will be interactive outputs. Users can choose sample types that are more numerous and filter out samples that are less numerous. This prevents the noise from individual data points from being too large or affecting batch correction.

  • Please pay special attention! The primary purpose of the examples in this chapter is merely to illustrate using public data. However, in actual applications, different batches and groupings might vary. The premise for batch correction in the functions shown in this chapter is based on a broad assumption that different tumor types have different batches. Users can replace it based on their actual situation.

  • If the database lacks the target sample type, users can input “skip” in the default_input parameter of the Combat_Normal function.

Tip
  • The ComBat_seq function called within requires the input data to be in matrix form, not a dataframe. If users want to change the function settings, they should take note of this key point during the modification.

  • The function converts the result back to a dataframe for the output, making it convenient for users’ subsequent data analysis.

  • The batch and group parameters inside the function are vectors. Users can directly set these vectors, but they must ensure they are consistent with the order of the count samples.

  • Numbers 01 to 09 represent different types of tumor samples, and 10 to 19 represent different types of normal samples.

  • 01 (primary solid tumors) and 11 (normal solid tissues) are the most common, while 06 represents metastasis.

  • It is generally recommended that users select 01, 06, and 11 during interaction. The decision to choose other numbers depends on the specific situation. Types with too few samples might contribute more noise than value to the overall data, making them less meaningful. They are not suitable for batch correction and are therefore not recommended for selection.

3.1 Different tumor types in TCGA

Please note: The TumorHistologicalTypes/NormalHistologicalTypes classifications are specific to the data in the TCGA database.

3.1.1 SKCM_combat_count

SKCM_combat_count <- combat_tumor(
    tumor_data_path = "../test_TransProR/generated_data1/SKCM_Skin_TCGA_exp_tumor.rds", 
    CombatTumor_output_path = "../test_TransProR/generated_data1/removebatch_SKCM_Skin_TCGA_exp_tumor.rds",
    auto_mode = T,
    default_input = "01,06"
)
TumorHistologicalTypes
 01  06  07 
103 367   1 
Found 2 batches
Using null model in ComBat-seq.
Adjusting for 0 covariate(s) or covariate level(s)
Estimating dispersions
Fitting the GLM model
Shrinkage off - using GLM estimates for parameters
Adjusting the data
head(SKCM_combat_count)[1:5, 1:5]
         TCGA-D9-A4Z2-01A TCGA-ER-A2NH-06A TCGA-BF-A5EO-01A TCGA-D9-A6EA-06A
TSPAN6               5707             1447             1049             1636
TNMD                    0                2                0                2
DPM1                 2201             1296              624             2482
SCYL3                1261              524              307              714
C1orf112             1107              373              295              564
         TCGA-D9-A4Z3-01A
TSPAN6               1511
TNMD                    0
DPM1                 1180
SCYL3                 522
C1orf112              557

3.1.2 BRCA_combat_count

BRCA_combat_count <- combat_tumor(
    tumor_data_path = "../test_TransProR/generated_data1/BRCA_Breast_TCGA_exp_tumor.rds", 
    CombatTumor_output_path = "../test_TransProR/generated_data1/removebatch_BRCA_Breast_TCGA_exp_tumor.rds",
    auto_mode = T,
    default_input = "01"
)
TumorHistologicalTypes
  01   06 
1097    7 
head(BRCA_combat_count)[1:5, 1:5]
         TCGA-A2-A0CY-01A TCGA-B6-A40B-01A TCGA-AO-A0J8-01A TCGA-A8-A08J-01A
TSPAN6               3166             3886             3878             2366
TNMD                    0               41               32                7
DPM1                 1259             2126             2146             4987
SCYL3                1123             2112             4348             2843
C1orf112              445              681             1311              506
         TCGA-E2-A14N-01A
TSPAN6               2701
TNMD                   11
DPM1                 2439
SCYL3                 936
C1orf112             1221

3.1.3 LGG_combat_count

LGG_combat_count <- combat_tumor(
    tumor_data_path = "../test_TransProR/generated_data1/LGG_Brain_TCGA_exp_tumor.rds", 
    CombatTumor_output_path = "../test_TransProR/generated_data1/removebatch_LGG_Brain_TCGA_exp_tumor.rds",
    auto_mode = T,
    default_input = "01"
)
TumorHistologicalTypes
 01 
511 
head(LGG_combat_count)[1:5, 1:5]
         TCGA-VM-A8C8-01A TCGA-VV-A829-01A TCGA-DH-5141-01A TCGA-RY-A840-01A
TSPAN6               1779             2776             3708             2284
TNMD                    0                2               17                1
DPM1                 1049             1102             1119              951
SCYL3                 447              757              522              565
C1orf112              193              276              113              155
         TCGA-DB-A64V-01A
TSPAN6               2704
TNMD                    3
DPM1                 1170
SCYL3                 550
C1orf112              191

3.1.4 THCA_combat_count

THCA_combat_count<-combat_tumor(
    tumor_data_path = "../test_TransProR/generated_data1/THCA_Thyroid_TCGA_exp_tumor.rds", 
    CombatTumor_output_path = "../test_TransProR/generated_data1/removebatch_THCA_Thyroid_TCGA_exp_tumor.rds",
    auto_mode = T,
    default_input = "01"
)
TumorHistologicalTypes
 01  06 
502   8 
head(THCA_combat_count)[1:5, 1:5]
         TCGA-BJ-A28W-01A TCGA-EM-A1CT-01A TCGA-EL-A4JZ-01A TCGA-EL-A3CT-01A
TSPAN6               1044             1271             2474             4254
TNMD                    3                0                1                7
DPM1                 1134              715             1668             1746
SCYL3                 451              623              772              545
C1orf112              122              165              225              196
         TCGA-ET-A2MY-01A
TSPAN6               2471
TNMD                    0
DPM1                 1726
SCYL3                1048
C1orf112              240

3.2 GTEX and TGCA’s normal tissue

3.2.1 SKCM_Skin_Combat_Normal_TCGA_GTEX_count

SKCM_Skin_Combat_Normal_TCGA_GTEX_count <- Combat_Normal(
    TCGA_normal_data_path = "../test_TransProR/generated_data1/SKCM_Skin_TCGA_exp_normal.rds", 
    gtex_data_path = '../test_TransProR/generated_data1/Skin_SKCM_Gtex.rds', 
    CombatNormal_output_path = '../test_TransProR/generated_data1/removebatch_SKCM_Skin_Normal_TCGA_GTEX_count.rds',
    auto_mode = T,
    default_input = "skip"
)
NormalHistologicalTypes
11 
 1 
head(SKCM_Skin_Combat_Normal_TCGA_GTEX_count)[1:5, 1:5]
         GTEX-111CU-1126-SM-5EGIM GTEX-111CU-1926-SM-5GZYZ
TSPAN6                        249                      827
TNMD                            8                       87
DPM1                          363                      656
SCYL3                         287                      799
C1orf112                       72                      264
         GTEX-111FC-0126-SM-5N9DL GTEX-111FC-2526-SM-5GZXU
TSPAN6                        269                      455
TNMD                          109                      177
DPM1                          822                      532
SCYL3                        1079                      614
C1orf112                      237                      191
         GTEX-111VG-0008-SM-5Q5BG
TSPAN6                        434
TNMD                            0
DPM1                         1031
SCYL3                         254
C1orf112                      275

3.2.2 BRCA_Breast_Combat_Normal_TCGA_GTEX_count

BRCA_Breast_Combat_Normal_TCGA_GTEX_count <- Combat_Normal(
    TCGA_normal_data_path = "../test_TransProR/generated_data1/BRCA_Breast_TCGA_exp_normal.rds", 
    gtex_data_path = '../test_TransProR/generated_data1/Breast_BRCA_Gtex.rds', 
    CombatNormal_output_path = '../test_TransProR/generated_data1/removebatch_BRCA_Breast_Normal_TCGA_GTEX_count.rds',
    auto_mode = T,
    default_input = "11"
)
NormalHistologicalTypes
 11 
113 
Found 2 batches
Using null model in ComBat-seq.
Adjusting for 0 covariate(s) or covariate level(s)
Estimating dispersions
Fitting the GLM model
Shrinkage off - using GLM estimates for parameters
Adjusting the data
head(BRCA_Breast_Combat_Normal_TCGA_GTEX_count)[1:5, 1:5]
          TCGA-BH-A1F0-11B TCGA-BH-A0BZ-11A TCGA-AC-A2FM-11B TCGA-BH-A0HA-11A
5_8S_rRNA                0                0                0                0
5S_rRNA                  0                0                0                0
7SK                      0                0                0                0
A1BG                   113               41              246              175
A1BG-AS1               250              147              381              239
          TCGA-BH-A1FU-11A
5_8S_rRNA                0
5S_rRNA                  0
7SK                      0
A1BG                   147
A1BG-AS1               150

3.2.3 LGG_Brain_Combat_Normal_TCGA_GTEX_count

LGG_Brain_Combat_Normal_TCGA_GTEX_count <- Combat_Normal(
    TCGA_normal_data_path = "../test_TransProR/generated_data1/LGG_Brain_TCGA_exp_normal.rds", 
    gtex_data_path = '../test_TransProR/generated_data1/Brain_LGG_Gtex.rds', 
    CombatNormal_output_path = '../test_TransProR/generated_data1/removebatch_LGG_Brain_Normal_TCGA_GTEX_count.rds',
    auto_mode = T,
    default_input = "skip"
)
< table of extent 0 >
head(LGG_Brain_Combat_Normal_TCGA_GTEX_count)[1:5, 1:5]
         GTEX-1117F-3226-SM-5N9CT GTEX-111FC-3126-SM-5GZZ2
TSPAN6                        645                      241
TNMD                            3                        5
DPM1                          400                      527
SCYL3                         188                      421
C1orf112                      112                      158
         GTEX-111FC-3326-SM-5GZYV GTEX-1128S-2726-SM-5H12C
TSPAN6                        120                      261
TNMD                            0                        3
DPM1                          354                      456
SCYL3                         390                      256
C1orf112                      300                       94
         GTEX-1128S-2826-SM-5N9DI
TSPAN6                         57
TNMD                            0
DPM1                          305
SCYL3                         217
C1orf112                      245

3.2.4 THCA_Thyroid_Combat_Normal_TCGA_GTEX_count

THCA_Thyroid_Combat_Normal_TCGA_GTEX_count <- Combat_Normal(
    TCGA_normal_data_path = "../test_TransProR/generated_data1/THCA_Thyroid_TCGA_exp_normal.rds", 
    gtex_data_path = '../test_TransProR/generated_data1/Thyroid_THCA_Gtex.rds', 
    CombatNormal_output_path = '../test_TransProR/generated_data1/removebatch_THCA_Thyroid_Normal_TCGA_GTEX_count.rds',
    auto_mode = T,
    default_input = "11"
)
NormalHistologicalTypes
11 
58 
Found 2 batches
Using null model in ComBat-seq.
Adjusting for 0 covariate(s) or covariate level(s)
Estimating dispersions
Fitting the GLM model
Shrinkage off - using GLM estimates for parameters
Adjusting the data
head(THCA_Thyroid_Combat_Normal_TCGA_GTEX_count)[1:5, 1:5]
          TCGA-BJ-A28W-11A TCGA-EM-A1CT-11A TCGA-EL-A3MX-11A TCGA-DO-A1JZ-11A
5_8S_rRNA                0                0                0                0
5S_rRNA                  0                0                0                0
7SK                      0                0                0                0
A1BG                   203              233              206              148
A1BG-AS1               240              323              360              324
          TCGA-EL-A3TA-11A
5_8S_rRNA                0
5S_rRNA                  0
7SK                      0
A1BG                   154
A1BG-AS1               249